# Multimodal Question Answering

Docscopeocr 7B 050425 Exp
Apache-2.0
docscopeOCR-7B-050425-exp is a model fine-tuned based on Qwen/Qwen2.5-VL-7B-Instruct, focusing on document-level OCR, long-context visual language understanding, and accurate image-to-text conversion of mathematical LaTeX formats.
Image-to-Text Transformers Supports Multiple Languages
D
prithivMLmods
531
2
Videochat R1 7B Caption
Apache-2.0
VideoChat-R1_7B_caption is a multimodal video-text generation model based on Qwen2-VL-7B-Instruct, focusing on video content understanding and description generation.
Video-to-Text Transformers English
V
OpenGVLab
48
1
Vica
Apache-2.0
ViCA-7B is a vision-language model fine-tuned specifically for visual-spatial reasoning in indoor video environments. Built on the LLaVA-Video-7B-Qwen2 architecture and trained using the ViCA-322K dataset, it emphasizes structured spatial annotation and instruction-based complex reasoning tasks.
Video-to-Text Transformers English
V
nkkbr
41
0
Deepseer R1 Vision Distill Qwen 1.5B Google Vit Base Patch16 224
Apache-2.0
DeepSeer is a vision-language model developed based on the DeepSeek-R1 model, supporting chain-of-thought reasoning and trained through dialogue templates for visual models.
Image-to-Text Transformers
D
mehmetkeremturkcan
25
2
Videorefer 7B
Apache-2.0
VideoRefer-7B is a multimodal large language model focused on video question answering tasks, capable of understanding and analyzing spatiotemporal object relationships in videos.
Text-to-Video Transformers English
V
DAMO-NLP-SG
87
4
Llava SpaceSGG
Apache-2.0
LLaVA-SpaceSGG is a visual question-answering model based on LLaVA-v1.5-13b, focusing on scene graph generation tasks. It can understand image content and generate structured scene descriptions.
Text-to-Image English
L
wumengyangok
36
0
Longvu Qwen2 7B
Apache-2.0
LongVU is a multimodal model based on Qwen2-7B, focusing on long video language understanding tasks and employing spatio-temporal adaptive compression technology.
Video-to-Text
L
Vision-CAIR
230
69
Table Llava V1.5 7b
Table LLaVA 7B is an open-source multimodal chatbot specifically designed for understanding various table images and performing diverse table-related tasks.
Image-to-Text Transformers English
T
SpursgoZmy
165
12
Monkey Chat
The Monkey Model is a large multimodal model that excels in various visual tasks by enhancing image resolution and improving text labeling methods.
Image-to-Text Transformers
M
echo840
179
16
Instructblip Vicuna 13b
Other
InstructBLIP is the visual instruction-tuned version of BLIP-2, based on the Vicuna-13b language model, designed for vision-language tasks.
Image-to-Text Transformers English
I
Salesforce
1,251
42
Instructblip Flan T5 Xxl
MIT
InstructBLIP is the vision-instruction-tuned version of BLIP-2, capable of generating descriptions or answers based on images and text instructions
Image-to-Text Transformers English
I
Salesforce
937
21
Video Blip Flan T5 Xl Ego4d
MIT
VideoBLIP is an enhanced version of BLIP-2 capable of processing video data, using Flan T5-xl as the backbone language model.
Video-to-Text Transformers English
V
kpyu
40
3
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase